Analysis of contributors to Prosper loan scores
by Charles Dellinger

The data set is about personal loans made through Prosper Marketplace. Prosper anonymously connects private users who want to lend and borrow money. The marketplace allows users to apply and receive loans from $2,000 to $40,000. During the application process, fairly standard information, like income, occupation, assets, and credit score, is collected. The application information with additional data about the loan itself, including current state, is what this data set contains. A full list of data collected about each loan is below:

##  [1] "AmountDelinquent"                   
##  [2] "AvailableBankcardCredit"            
##  [3] "BankcardUtilization"                
##  [4] "BorrowerAPR"                        
##  [5] "BorrowerRate"                       
##  [6] "BorrowerState"                      
##  [7] "ClosedDate"                         
##  [8] "CreditGrade"                        
##  [9] "CreditScoreRangeLower"              
## [10] "CreditScoreRangeUpper"              
## [11] "CurrentCreditLines"                 
## [12] "CurrentDelinquencies"               
## [13] "CurrentlyInGroup"                   
## [14] "DateCreditPulled"                   
## [15] "DebtToIncomeRatio"                  
## [16] "DelinquenciesLast7Years"            
## [17] "EmploymentStatus"                   
## [18] "EmploymentStatusDuration"           
## [19] "EstimatedEffectiveYield"            
## [20] "EstimatedLoss"                      
## [21] "EstimatedReturn"                    
## [22] "FirstRecordedCreditLine"            
## [23] "GroupKey"                           
## [24] "IncomeRange"                        
## [25] "IncomeVerifiable"                   
## [26] "InquiriesLast6Months"               
## [27] "InvestmentFromFriendsAmount"        
## [28] "InvestmentFromFriendsCount"         
## [29] "Investors"                          
## [30] "IsBorrowerHomeowner"                
## [31] "LenderYield"                        
## [32] "ListingCategory..numeric."          
## [33] "ListingCreationDate"                
## [34] "ListingKey"                         
## [35] "ListingNumber"                      
## [36] "LoanCurrentDaysDelinquent"          
## [37] "LoanFirstDefaultedCycleNumber"      
## [38] "LoanKey"                            
## [39] "LoanMonthsSinceOrigination"         
## [40] "LoanNumber"                         
## [41] "LoanOriginalAmount"                 
## [42] "LoanOriginationDate"                
## [43] "LoanOriginationQuarter"             
## [44] "LoanStatus"                         
## [45] "LP_CollectionFees"                  
## [46] "LP_CustomerPayments"                
## [47] "LP_CustomerPrincipalPayments"       
## [48] "LP_GrossPrincipalLoss"              
## [49] "LP_InterestandFees"                 
## [50] "LP_NetPrincipalLoss"                
## [51] "LP_NonPrincipalRecoverypayments"    
## [52] "LP_ServiceFees"                     
## [53] "MemberKey"                          
## [54] "MonthlyLoanPayment"                 
## [55] "Occupation"                         
## [56] "OnTimeProsperPayments"              
## [57] "OpenCreditLines"                    
## [58] "OpenRevolvingAccounts"              
## [59] "OpenRevolvingMonthlyPayment"        
## [60] "PercentFunded"                      
## [61] "ProsperPaymentsLessThanOneMonthLate"
## [62] "ProsperPaymentsOneMonthPlusLate"    
## [63] "ProsperPrincipalBorrowed"           
## [64] "ProsperPrincipalOutstanding"        
## [65] "ProsperRating..Alpha."              
## [66] "ProsperRating..numeric."            
## [67] "ProsperScore"                       
## [68] "PublicRecordsLast10Years"           
## [69] "PublicRecordsLast12Months"          
## [70] "Recommendations"                    
## [71] "RevolvingCreditBalance"             
## [72] "ScorexChangeAtTimeOfListing"        
## [73] "StatedMonthlyIncome"                
## [74] "Term"                               
## [75] "TotalCreditLinespast7years"         
## [76] "TotalInquiries"                     
## [77] "TotalProsperLoans"                  
## [78] "TotalProsperPaymentsBilled"         
## [79] "TotalTrades"                        
## [80] "TradesNeverDelinquent..percentage." 
## [81] "TradesOpenedLast6Months"

Univariate Exploration

Univariate Plots Section

IncomeRange

The histogram of IncomeRange shows that most of the borrowers make between $25,000 to $75,000. However, there’s a decent amount of ‘Not displayed’, which is assumed to be borrower’s who wanted to abstain from reporting or having it reported. The ‘Not displayed’ seems to be a lot smaller than most of the range counts, so it isn’t likely to cause a misrepresentation if this variable is used.

StatedMonthlyIncome

The stated monthly incomes look to be all bunched up in the low end. Zooming in on the plot will give a better idea of the distribution.

The stated monthly income histogram paints a slightly different picture than the income range bins. The income range bins had more of a bell-curve type shape, whereas the stated income curve looks skewed right.

Will the StatedMonthlyIncome look the same if it is sorted into buckets?

The histogram of StatedMonthlyIncome buckets doesn’t seem to align with the income buckets outlined by IncomeRange. Since most of the incomes are all in one range, there’s some problem with the StatedMonthlyIncome variable.

Occupation

Occupation is going to be pretty useless for categorization as it is since there are so many. Even if grouped, a category like healthcare workers would have different results for instance with a “Nurse’s Aide” versus a “Doctor”. Also, the catch-all’s of “Other” and “Professional” are ambiguous and seem to dominate; the proportion of Occupation entries:

##    Professional           Other Everything Else 
##       0.1196100       0.2511651       0.6292249

With almost 40% of Occupation entries unusable, the subgroup analysis probably won’t produce generalizable results.

Debt-to-Income Ratio

Almost all the debt-to-income ratios are less than 1. People with more debt to pay than their income would have an almost 100% chance of default.

##   <= 1    > 1 
## 104584    799

Creating custom bins can help get a better feel with for the data with smaller bins in the lower ratio sections to see finer details there.

Delinquencies in the Last 7 Years

The users applying for loans had a fair number of instances where the payments were late. This metric is a mixed bag for lenders as it would mean that the borrower would accrue interest but also the borrower has a higher risk of default.

Total Trades on the Prosper Platform

In general, the number of trades is skewed to the right, but the curve between 0 and 50 shows that a lot of borrowers are repeat users.

Term Length of Loans

The loans seem to be grouped in three discrete lengths.

Term Lengths
Term N
12 1614
36 87778
60 24545

The table confirms that the data set contains only three loan lengths, which correspond to 1, 3, and 5 years. This variable would be a good variable to change to a categorical variable to split other plots in the future.

Home Ownership

Home ownership would show that a person has an asset, which would make the loan more viable because the borrower can both make payments and has something to borrow against if they need money to meet their liabilities.

This variable might be a good one to use to see split scores or behavior between those that own a home or not.

BankcardUtilization

This variable is strange: “The percentage of available revolving credit that is utilized at the time the credit profile was pulled.” It seems like the amount of available credit that the borrower is utilizing, or home much of the credit card potential have they used.

Bankcard Utilization
Statistic Value
Min. 0.00
1st Qu. 0.31
Median 0.60
Mean 0.56
3rd Qu. 0.84
Max. 5.95
NA’s 7604.00

The shape of the bank card utilization histogram shows a spike at 0% and then an increasing trend as the percentage approaches 100%. The box plot show that there are a fair number of people that have somehow over-drafted their bank cards to a pretty high extent. The bank card percentage score would be an interesting metric to test correlations to in the multivariate analysis.

Credit Score Range

The shapes of the upper and lower credit scores histograms look roughly the same, so I would expect that the ranges are of a standard length of points. A quick graph of the differences will confirm.

## 
##     19 
## 113346

Since the scores are of fixed length, any use of credit scores can use either one to view any trends or make predictions.

Prosper Score

## 
##     1     2     3     4     5     6     7     8     9    10    11 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456

There seem to be only discrete Prosper scores and for some reason there are score at 11, despite the range in the variable definition file indicating the maximum is 10. Since Prosper score is made up of discrete levels, it can be transformed into a categorical variable.

Borrower APR

The peak for borrower rates is on par with credit card rates with a roughly even distribution above and below. This variable will probably be good to use as a predictor.

Loan Amount

According to the Prosper website, the loans are from $2000 to $40000. Most of the loans are of low value, but there are some clear spikes at higher numbers at 5k intervals ($10000, $15000, $20000, $25000), which shows a bias for people rounding off for their needs.

Univariate Analysis

What is the structure of your dataset?

There are 113937 entries, each containing 84 variables of different types.

All of the data categories are related to whether the loan would be a good investment or not, which is presumably wrapped into their proprietary Prosper score.

What is/are the main feature(s) of interest in your dataset?

The main features of concern are going to be the variables that are commonly primary contributors to risk, such as assets, liabilities, and current state of leverage. Presumably these will be rolled up and reflected in the Prosper score (“ProsperScore”). The following are a listing of those variables: “IncomeRange”, “DebtToIncomeRatio”, “DelinquenciesLast7Years”, “TotalTrades”, “Term”, “IsBorrowerHomeowner”, “BankcardUtilization”, “BorrowerAPR”, “LoanOriginalAmount”, “CreditScoreRangeLower”/“CreditScoreRangeUpper”

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Additional features that might build a richer picture would include: “ListingCategory..numeric.” - to see what the loan is for “LoanStatus” - to check on overdue loans “Investors” - how many people have confidence that the borrower will pay? “Recommentations” - how many people vouch for the borrower? “EmploymentStatusDuration” - is the borrower’s cash flow consistent? “Estimated Return” - to measure risk/benefit ratio “Estimated Loss” - complement of the return

Did you create any new variables from existing variables in the dataset?

I created IncomeRange2 and StatedMonthlyIncome.bucket: * IncomeRange2 - consolidates “Not employed” to have an income of $0 * StatedMonthlyIncome.bucket - bucketed the StatedMonthlyIncome to test against IncomeRange, but the results made it seem like the monthly income wasn’t as reliable. * Term2 - changed Term into a categorical variable * ProsperScore2 - changed ProsperScore into a categorical variable

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

All the data were skewed appropriately or of a normalized scoring. The exceptions to a lot of distributions were around even number intervals, because people round their expectation needs.

There were some strange things within some of the variables. The Prosper score had scores of 11 even though the range was stated to be 1-10. There were bankcard utilizations and debt-to-income ratios that were questionable, because the borrower had much greater debt than they could pay, like the borrower that had almost 600% bankcard utilization and the 799 borrowers that had more debt payments per month than their monthly income.

Bivariate & Multivariate Exploration

In general, the bivariate plots didn’t show anything unless augmented by another variable, so the plot sections have been merged together. Most of the plots are multivariate, but there are a few bivariate where the relationship is straightforward enough.

Bivariate & Multivariate Plots Section

A quick grid of relationships to see if anything sticks out:

Prosper score seems like a good place to start looking at affects of variables on each other since it is a sort of summation of risk for the loan. The amount of risk is usually passed on as a loan penalty in the form of APR. Therefore, there should be relationship between Prosper and APR such that increased Prosper score should have a lower APR and vice versa.

As expected, there’s a relationship with borrower APR and Prosper score. The higher the Prosper score, the less risk, so the APR is lower. The coloring of the visualization doesn’t reveal that Term clearly augments this relationship despite longer term loans having an intrinsically higher risk.

Since Prosper score is proprietary, it will be interesting to find out what else goes into the Proper score. Outside of Prosper, APRs generally trend with the non-proprietary credit score, so Prosper score and credit score would be expected to be roughly trend together.

Strangely, there doesn’t seem to be much correlation if any between Prosper score and Credit score despite both being compilation measures of risk for financial institutions. Just how weak is the correlation?

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and CreditScoreRangeLower
## t = 115.87, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3637793 0.3753979
## sample estimates:
##      cor 
## 0.369603

Despite the weak correlation of Prosper score and credit score, the coloring of the visualization does reveal that people with higher income are more likely to have a higher Prosper score, regardless of credit score.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and BorrowerAPR
## t = -261.68, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6719940 -0.6645469
## sample estimates:
##        cor 
## -0.6682872
## 
##  Pearson's product-moment correlation
## 
## data:  CreditScoreRangeLower and BorrowerAPR
## t = -160.21, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4344422 -0.4249487
## sample estimates:
##        cor 
## -0.4297073

Another measure of interest would be the relationship between the debt-to-income ratio and Prosper score, since it is more risky to grant a loan to a borrower as their amount of existing debt increases, or the borrower is more leveraged.

The relationship between debt-to-income ratio and Prosper score isn’t as obvious as expected, but the debt-to-income ratio range for the Prosper score extends farther into a bad ratio for lower Prosper scores. However, the visualization does show a relationship between the Prosper scare and income range. Borrowers with a higher income are more likely to have a higher Prosper score across all debt-to-income ratios.

Does total trades affect the Prosper score?

The total number of trades on the platform doesn’t affect the Prosper score. To eliminate the possibility of delinquent trades and good trades mixed together causing hiding any trend, coloring based on percentage of good trades was added. The coloring shows that all the colors are dispersed fairly evenly, so there really isn’t any trend.

The amount of delinquencies should affect the BorrowerAPR and Prosper score, since delinquencies could be a sign that the borrow has a hard time paying back the loan.

The expected trend between APR and delinquencies doesn’t seem to show up even when coloring based on previously identified categories. Will Prosper score show the same lack of trend? Or will the expected trend appear?

Another logical affector on measures of risk (Prosper score and borrower APR) would be how much available credit a borrower has to fulfill the obligation to the Prosper loan. In theory, its a liquidity measure, so the more unused immediately usable credit, the higher the Prosper score and lower the APR.

The expected trends hold up for Prosper score and APR vs bankcard utilization.

There haven’t been too many trends in the Prosper score, so borrower APR will be the focus moving forward.

Looking at the income range and borrower APR visualization, the APR’s of higher income borrowers are consolidated in a narrower areas in the lower APR range. The high income borrowers look more likely to take out loans of higher amounts. Lower income borrowers with low APR seem slightly more likely to take out larger loan amounts.

Having an asset like a house would theoretically be good if a borrower is going to take a loan.

The expected relationship between APR and home ownership doesn’t seem to hold true. Is there a relationship between home ownership and income range?

People that make more money seem to be more likely to own a house, but the effect isn’t as strong as expected in the upper incomes.

This visualization shows similar results to the previous ones. Borrower’s with higher incomes are more likely to have a better APR and be homeowners. The visualization potentially misrepresents that there’s a stronger relationship with home ownership and APR.

Despite it popping up in other visualizations, the relationship between Prosper score and income range hasn’t been examined.

The relationship isn’t as clear as expected, but the trend for higher Prosper score for higher income range is there.

Auxiliary Measures

Plots to look it will be sifted from a quick grid of relationships.

The Prosper score and estimated return look like they have some sort of relationship, so that will be first.

The visualization is zoomed in where the most points are. There seems to be a correlation, which makes sense. The higher the Prosper score, the less risk, so the APR should be lower to reflect that, which means the return from interest will be lower, in turn. The relationship can be verified by looking at the APR visualizations.

The relationship between return an APR was interesting. There looked to be a trend, but also it was clear there was something else augmenting it because there was banding and some trends with different slope. After splitting the visualization, most of the trends can be seen as owing to term length and Prosper score (risk).

Another variable that seems to have some trends form the grid is employment duration.

In the grid, there seemed to be a clearer positive correlation between employment duration and income range, but on closer inspection, the employment duration seems to peak in the middle income ranges and then reduce back to a slightly higher baseline.

Additionally, there seemed to be a relationship with employment duration and estimated return, but the trend seems to be blending of APR bands across all employment durations.

Bivariate & Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the data set?

The results of the investigation were mixed with some things trended as expected and others did not. Examples of unexpected trends, the credit score had a very weak correlation with the Prosper score, and a history of delinquencies, debt-to-income ratio, and home ownership didn’t seem to affect the Prosper score. Expected influencers based on theory showed up again with trends relating to Prosper score such as interest rate [APR] and income range.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Of the secondary features discussed in the univariate section, the one that showed the best trends and, thus, visualizations was estimated return

What was the strongest relationship you found?

Prosper score and income range kept popping up, but when visualized individually it wasn’t as strong as expected. The strongest relationship (based on visualization) would probably be Prosper score to bankcard utilization. The relationship owes itself to the bankcard credit line essentially acting as a liquid asset to pay for the Prosper loan payment, making unused bankcard use a strong indicator of reduced risk.


Final Plots and Summary

Plot One

Description One

The visualization shows the relationship of the proprietary risk score to the borrower’s interest rate (APR) and income range. In theory, a better (higher) Prosper score means less risk and, therefore, a lower borrowing penalty in the form of interest rate. Additionally, a higher cash flow, in the form of income range, should reduce risk and result in a higher Prosper score and lower interest rate. The plot corroborates with the theory behind the relationships.

Plot Two

Description Two

The visualization shows the relationship of Prosper score and interest rate to estimated return. The 1 and 5 year loans look like they have sloped lines with high interest rate and thus more return with worse Prosper score, which aligns with the theory of increased interest burden with increased risk and the interest burden creates the return. The 3 year loans have a piece in the middle that matches up to the trend of the 1 and 5 year loans, but is also seems to have banding where the return is shifting to less and less as the Prosper score goes down reflecting the theoretical increase risk of default for people with a worse proprietary credit score.

Plot Three

Description Three

The visualization shows that the clustering of borrowers with a higher income shifts up and to the left, so higher income borrowers are more likely to have a lower debt-to-income ratio and better Prosper score. These trends align with theory that, all things equal, more cash flow will result in less debt.


Reflection

The exploration ended up with different results than I had imagined. There are a lot of variables that theoretically influence risk that only had a weak correlation with the Prosper score, which is the company’s own credit score. For instance, having more liquidity with lower bankcard utilization and higher cash flow with income range had a correlation which make sense because both would indicate an increased probability of paying the monthly payment. However, other indicators of risk didn’t have an effect. Debt-to-income ratio (leverage), which shows the proportion of liabilities to assets, should have some correlation, since more debt repayment would indicate more risk that new debt might not get repaid, but the ratio didn’t exhibit any noticeable correlation. A tangible asset like home ownership would also be something to borrow against or could show that the borrower has the ability to pay big loans consistently, but home ownership didn’t seem to have any affect on the Prosper score either. Similarly, having a history of delinquent loans didn’t seem to have a dramatic affect, which should be a direct indicator of the risk of the loan getting paid back. Additionally, the widely used credit score did not track well with Prosper score, showing only a weak correlation.

With mixed results of theoretical predictors, exploration of additional variables would be needed to sufficienty discover the components of how Prosper determines and evaluates risk. Potential angles include examining loan statuses and defining good and bad loans, because separating categories like late payments (<30 days) versus people not being able to pay may paint a different picture than just using “DelinquenciesLast7Years or”TradesNeverDelinquent..percentage“. Another thing to check into is trends over time for risk measures using the closing date of the loan. Also, examination of the historical Prosper score with 7 letter scale versus the 1-10 number scale might be worth investigating to see if there’s a difference in evaluation moving forward from the switch, which might influence other things like interest rate.

References


RStudio cheatsheet
RStudio cheatsheet 2.0
StackOverflow: Rotating and spacing axis labels in ggplot2
Debt to Income Ratio
Remove all of x axis labels in ggplot (duplicate)
2 column Section in R Markdown
Tables with htmlTable and some alternatives
Center Plot title in ggplot2
Useful labeller functions
Rmarkdown font size and header